Skip to content

Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065

Open
paleolimbot wants to merge 5 commits into
apache:mainfrom
paleolimbot:parquet-geospatial-typo
Open

Add tests and fix corner cases for Parquet/GeoArrow extension type conversion#10065
paleolimbot wants to merge 5 commits into
apache:mainfrom
paleolimbot:parquet-geospatial-typo

Conversation

@paleolimbot

@paleolimbot paleolimbot commented Jun 3, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

Rationale for this change

There were several issues with conversion identified when I tried to integrate this in SedonaDB and that came to light when the spec was recently clarified.

I am sorry for missing these changes when I reviewed the initial implementation.

What changes are included in this PR?

  • A Parquet crs of None for geometry or geography is now converted to a GeoArrow CRS of "OGC:CRS84" (the named value for the default CRS in the Parquet spec)
  • A Parquet crs of "srid:0" is now converted to a GeoArrow "omitted" CRS. This was recently clarified in the Parquet spec (srid:0 is a named example in the list of allowed values).
  • A GeoArrow missing CRS is now encoded as "srid:0"
  • A GeoArrow CRS that is "lonlat-like" is now encoded as a Parquet crs of None. This logic was included in the previous implementation but was reversed (Parquet CRSes that looked like lonlat were omitted when written to GeoArrow, which is not correct).
  • The GeoArrow metadata struct uses the name "algorithm" and was serializing it to JSON. The GeoArrow spec uses the "edges" key. This led to invalid metadata being generated which was either rejected or incorrectly interpreted by consumers.

Are these changes tested?

Yes. I added high-level end-to-end LogicalType <-> extension metadata tests, since that is what matters (there were a few lower level tests that I updated as well).

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 3, 2026
Comment on lines +437 to +495
// Geometry with default CRS (defaults to OGC:CRS84 per Parquet spec)
(LogicalType::geometry(None), r#"{"crs":"OGC:CRS84"}"#),
// Geometry with srid:0 should result in an unset (omitted) CRS
(LogicalType::geometry(Some("srid:0".to_string())), r#"{}"#),
// Geometry with custom CRSes (authority:code and partial projjson)
(
LogicalType::geometry(Some("EPSG:4267".to_string())),
r#"{"crs":"EPSG:4267"}"#,
),
(
LogicalType::geometry(Some(
r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string(),
)),
r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#,
),
// Geography with default CRS (default OGC:CRS84, spherical edges)
(
LogicalType::geography(None, None),
r#"{"crs":"OGC:CRS84","edges":"spherical"}"#,
),
// Geography with explicit edges
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::SPHERICAL)),
r#"{"crs":"OGC:CRS84","edges":"spherical"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)),
r#"{"crs":"OGC:CRS84","edges":"karney"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)),
r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)),
r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#,
),
(
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)),
r#"{"crs":"OGC:CRS84","edges":"thomas"}"#,
),
// Geometry with srid:0 should result in an unset (omitted) CRS
// and spherical edges
(
LogicalType::geography(Some("srid:0".to_string()), None),
r#"{"edges":"spherical"}"#,
),
// Geography with custom CRSes (authority:code and partial projjson)
(
LogicalType::geography(Some("EPSG:4267".to_string()), None),
r#"{"crs":"EPSG:4267","edges":"spherical"}"#,
),
(
LogicalType::geography(
Some(r#"{"id":{"authority":"EPSG","code":4326}}"#.to_string()),
None,
),
r#"{"crs":{"id":{"authority":"EPSG","code":4326}},"edges":"spherical"}"#,
),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the test cases for reading Parquet files. For GeoArrow implementation to correctly recognize the intent of the CRS field in the Parquet file, this is the GeoArrow extension metadata that must be produced in each of these cases.

There were a few problems with the existing implementation:

  • LogicalType::geometry(None) (i.e., Parquet default CRS) was read as as {} (which in GeoArrow land means "I don't know what the CRS is)
  • LogicalType::geography(Some(...), None) (i.e. Parquet default edge algorithm) was read as {"crs":"..."} (i.e., no edges in the GeoArrow metadata, which consumers would recognize as GEOMETRY and not GEOGRAPHY)
  • LogicalType::geography(Some(...), Some(EdgeInterpolationAlgorithm::SPHERICAL)) (i.e. an actual edge algorithm) was read as {"crs":"...","algorithm":"spherical"}. The "algorithm" key isn't valid in GeoArrow and this would be rejected by consumers or ignored and read as GEOMETRY (not geography).

Comment on lines +540 to +612
// Geometry with no CRS should be GEOMETRY(srid:0)
(r#"{}"#, LogicalType::geometry(Some("srid:0".to_string()))),
// Geometry with string CRS
(
r#"{"crs":"EPSG:4267"}"#,
LogicalType::geometry(Some("\"EPSG:4267\"".to_string())),
),
// Geometry with PROJJSON CRS
(
r#"{"crs":{"id":{"authority":"EPSG","code":3857}}}"#,
LogicalType::geometry(Some(
r#"{"id":{"authority":"EPSG","code":3857}}"#.to_string(),
)),
),
// Geometry with lon/lat CRSes (canonically removed because lon/lat is the
// default Parquet CRS)
(r#"{"crs":"OGC:CRS84"}"#, LogicalType::geometry(None)),
(r#"{"crs":"EPSG:4326"}"#, LogicalType::geometry(None)),
(
r#"{"crs":{"id":{"authority":"EPSG","code":4326}}}"#,
LogicalType::geometry(None),
),
(
r#"{"crs":{"id":{"authority":"EPSG","code":"4326"}}}"#,
LogicalType::geometry(None),
),
(
r#"{"crs":{"id":{"authority":"OGC","code":"CRS84"}}}"#,
LogicalType::geometry(None),
),
// Geography with no CRS, spherical edges
(
r#"{"edges":"spherical"}"#,
LogicalType::geography(Some("srid:0".to_string()), None),
),
// Geography with OGC:CRS84 and spherical edges
(
r#"{"crs":"OGC:CRS84","edges":"spherical"}"#,
LogicalType::geography(None, None),
),
// Geography with different edge algorithms
(
r#"{"crs":"OGC:CRS84","edges":"karney"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::KARNEY)),
),
(
r#"{"crs":"OGC:CRS84","edges":"vincenty"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::VINCENTY)),
),
(
r#"{"crs":"OGC:CRS84","edges":"andoyer"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::ANDOYER)),
),
(
r#"{"crs":"OGC:CRS84","edges":"thomas"}"#,
LogicalType::geography(None, Some(EdgeInterpolationAlgorithm::THOMAS)),
),
// Geography with custom CRS and edges
(
r#"{"crs":"EPSG:4267","edges":"karney"}"#,
LogicalType::geography(
Some("\"EPSG:4267\"".to_string()),
Some(EdgeInterpolationAlgorithm::KARNEY),
),
),
// Geography with PROJJSON CRS
(
r#"{"crs":{"id":{"authority":"EPSG","code":4267}},"edges":"spherical"}"#,
LogicalType::geography(
Some(r#"{"id":{"authority":"EPSG","code":4267}}"#.to_string()),
None,
),
),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the tests for writing. The main issue with writing was that "edges":"spherical", which GeoArrow uses to communicate GEOGRAPHY, was not actually writing a geography type because the implementation expected "algorithm". An unset CRS (geoarrow metadata {}) was also written as a Parquet default CRS which is technically incorrect but also not common.

@paleolimbot paleolimbot marked this pull request as ready for review June 5, 2026 02:18
@paleolimbot

Copy link
Copy Markdown
Member Author

cc @BlakeOrth

@BlakeOrth

Copy link
Copy Markdown
Contributor

Thanks for the ping, I'll make some time to review this today.

@BlakeOrth BlakeOrth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here look good to me, I didn't find anything meaningful that would require any follow-up. Thanks for taking care of this!

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @paleolimbot This all looks reasonable to me, but I am not enough of an expert to understand what the downstream implications of this are

@kylebarron I wonder if you might have some time to help review this PR and lend your specialized geoarrow knowledge ?

@paleolimbot do you have any sense

  1. Is this a "breaking change" (as in could we put it out in a minor version or should we wait for the next major version)?
  2. DO you have any suggestions on how to review this PR? Like other implementations I can cross reference with or links to the spec, etc?

// Geometry with string CRS
(
r#"{"crs":"EPSG:4267"}"#,
LogicalType::geometry(Some("\"EPSG:4267\"".to_string())),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right to have embedded quotes here? It seems strange (maybe it is because it is a JSON string and thus needs to be a valid JSON value)?

                LogicalType::geometry(Some("EPSG:4267".to_string())),

The same thing can be seen below too

@kylebarron

Copy link
Copy Markdown
Member

Sorry, I haven't been following the latest updates in the Parquet native geometry type spec 🥲

@kylebarron kylebarron left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good but I didn't take a super close look

@paleolimbot

Copy link
Copy Markdown
Member Author

Is this a "breaking change"

Definitely not in the "Rust code will stop compiling" sense. From a metadata semantics point of view I would consider this a bug fix...the currently emitted metadata would have either triggered errors or resulted in silently/subtly incorrect results in at least one common case (the default Parquet != GeoArrow default CRS), and intentionally invalid metadata is required to write geography. In order to have any of this code triggered one would need to build parquet with the geospatial feature enabled, which I am not sure is common (but I don't know this).

That said there is a workaround if this needs to be punted (in SedonaDB we have a UDF that intentionally creates invalid GeoArrow field metadata to write valid Parquet output).

DO you have any suggestions on how to review this PR?

I hope the below is helpful!

other implementations I can cross reference with

Arrow C++ would be the other one: https://github.com/apache/arrow/blob/bfc9cdb2cc8e1dcbfef5e97db2e3e9f5a15fd356/cpp/src/parquet/geospatial/util_json_internal.cc#L35-L142

The only difference here is the srid:0 conversion. That was only just clarified in the Parquet spec in the last few weeks and I haven't had a chance to PR it into that spec yet.

Like other implementations I can cross reference with

The two CRS specs:

Is it right to have embedded quotes here?

Good catch! I'll update. It works because we do try to parse the value as JSON first (to avoid taking a valid JSON object from the Parquet field, which is a string, escaping it, and re-inserting it into a the JSON GeoArrow metadata) but it's not what I was trying to test 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet geospatial conversion uses metadata key "algorithm" instead of "edges" in geoarrow metadata

4 participants